Estimation of cosmological parameters using adaptive importance sampling 
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We present a Bayesian sampUng algorithm called adaptive importance sampling or Population 
Monte Carlo (PMC), whose computational workload is easily parallelizable and thus has the po- 
tential to considerably reduce the wall-clock time required for sampling, along with providing other 
benefits. To assess the performance of the approach for cosmological problems, we use simulated and 
actual data consisting of CMB anisotropies, supernovae of type la, and weak cosmological lensing, 
and provide a comparison of results to those obtained using state-of-the-art Markov Chain Monte 
Carlo (MCMC). For both types of data sets, we find comparable parameter estimates for PMC and 
MCMC, with the advantage of a significantly lower computational time for PMC. In the case of 
WMAP5 data, for example, the wall-clock time reduces from several days for MCMC to a few hours 
using PMC on a cluster of processors. Other benefits of the PMC approach, along with potential 
difficulties in using the approach, are analysed and discussed. 
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I. INTRODUCTION 

In recent years we have seen spectacular advances in 
observational cosmology, with the availability of more 
and more high quality data allowing for the testing of 
models with higher complexity. Some of these tests have 
been made possible thanks to the use of Bayesian sam- 
pling techniques, and in particular Markov Chain Monte 
Carlo (MCMC) - an (iterative) algorithm that produces 
a Markov chain whose distribution converges to the tar- 
get posterior tt. After a "burn-in" period, samples from 
such a chain can be regarded as samples approximately 
from TT. Proposed values for the chain or the updating 
scheme of MCMC can be designed to ensure that moves 
towards regions of higher mass under tt are favored, and 
regions with null probability (under tt) are never visited. 
This way, most of the computational effort can be spent 
in the region of importance to the posterior distribution, 
and an MCMC approach is usually much more efficient 
than traditional grid sampling of the model parameter 
space. 

The MCMC technique is now well known in cosmology, 
and in particular in its most simple form, the Metropolis- 
Hastings algorithm, thanks to the user-friendly and 
freely available package COSMOMC ^. Other forms of 
the MCMC algorithm, like Gibbs-sampling and Hybrid 
Monte Carlo (better known in cosmology as Hamiltonian 
sampling) , have also been proposed and have found some 
interesting usage in the estimation of the posterior distri- 
bution for the Cosmic Microwave Background anisotropy 
power spectrum at low resolution (see [2| and references 
therein, also i and i). 

For all its advantages over grid sampling, the MCMC 
approach also sulfers from problems. One difficulty is to 
assess the correct convergence of the chain. Another lies 
in the presence of correlations within the chain which 



can greatly reduce the efficiency of the sample 01 • A 
third issue which is particularly relevant for the usage of 
MCMC in cosmology is the computational time involved. 
Indeed, whatever the sampling technique, we often need 
to compute at least one estimate of the posterior for each 
sampled point. This computation can be slow in cosmol- 
ogy. With the current processing speed of computers, a 
point of the posterior of, for example, the WMAP5 data 
set, using CAMB 01 and the public WMAP5 likehhood 
code Q^, both with their default precision settings, is 
computed at the order of several seconds, and can be 
much slower when exploring non-flat models. Of course, 
as stated above, for most problems MCMC will require 
orders of magnitude less samples than a grid for a given 
target precision, thus providing an important efficiency 
improvement. However, apart from improving the likeli- 
hood codes or waiting for the availability of faster com- 
puters there is not much speed improvement to expect 
from an MCMC approach, while probably needed if one 
wants to explore yet bigger and more complex models. 
On the algorithmic side of the problem, some effort has 
been devoted recently to the improvement of the like- 
lihood codes, mainly by using clever interpolation tricks 
(segmentation 01, neural networks Q) and by looking for 
improvements in the MCMC algorithm [1, [IH. The 
former 0, Q indeed provide some gain in efficiency, but 
at the cost of a long pre-computation step for each model. 
The latter improves on the natural inefficiency of the 
Metropolis Algorithm but imposes some other require- 
ments, like the availability of cheap computation of the 
derivatives of the likelihood @ , or the knowledge of con- 



^ http://camb.info 

^ http://lambda.gsfc.nasa.gov 
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ditional probabilities of some of the parameters [lO|, [ill . 
Other (non Markovian) Monte Carlo methods, such as 
nested sampling, have also been proposed and applied re- 
cently to cosmological problems with some success along 
with presenting their own problems 

On the hardware side, however, there is a route to 
speed improvement that does not lie in quicker CPUs, 
but on the availability of cheap multi-CPU computers 
and the standardization of clusters of computers. This 
opportunity, however, is only partly opened to MCMC. 
Indeed, there arc two ways of parallelizing the parame- 
ter exploration. First, by distributing the computation 
of the likelihood, which is not always possible and docs 
not always lead to speed improvement. Second, by run- 
ning multiple chains in parallel. This last option is the 
simplest, but is 'forbidden' by the iterative nature of 
the MCMC algorithm. More precisely, running paral- 
lel chains and mixing them in the end to build a bigger 
chain sample is of course possible (and can be advanta- 
geous in fully exploring the support of tt), but at the con- 
dition that each of the individual chains has converged. 
In the absence of such a condition, significant biases in 
the sample can be introduced. Determining convergence 
for each chain is inherently difficult in practice and has 
larg ely prevented more widespread use of the approach 
[15|. Thus, for MCMC any speed improvements through 
parallelization are difficult to achieve. 

In this paper, wc propose another sampling algorithm 
suitable for cosmological applications, that is not based 
upon MCMC, and can be parallelised. This novel algo- 
rithm, called Population Monte Carlo (PMC) is an adap- 
tive importance sampling technique, that has been stud- 
ied recently in the statistics literature [l^. While this 
algorithm solves some of the issues of MCMC in cosmol- 
ogy, the approach of course has a different set of potential 
problems that wc will analyse and discuss, along with its 
advantages. 

The paper is outlined as follows. In the next section, 
we provide a brief introduction to the Bayesian approach, 
which we hope will give the casual reader some important 
keys for further readings, and we also discuss the chal- 
lenges and issues involved with using either an MCMC 
or an importance sampling algorithm for estimation. We 
then describe details of the PMC approach. In Sect. IIIll 
we assess the performance of the PMC approach using a 
simulated target density with features similar to cosmo- 
logical parameter posteriors, and provide a comparison to 
results obtained using an MCMC approach. In Sect. llVi 
we illustrate the results from the PMC approach using 
actual data, consisting of CMB anisotropics, supernovae 
of type la and weak cosmological Icnsing. We conclude 
in Sect. |V] with a discussion and an outlook for further 
work. 



II. METHODS 

A. Bayesian inference via simulation 

A key feature of Bayesian inference is to provide a 
probabilistic expression for the uncertainty regarding a 
parameter of interest x by combining prior information 
along with information brought by the data. Prior infor- 
mation, for example, could take the form of information 
obtained from previous experiments which cannot read- 
ily be incorporated into the current experiment or simply 
consist of a feasible range. The absence of prior informa- 
tion, however, is not restriction for the use of Bayesian in- 
ference and estimation can still be regarded as valid [l3l ■ 
Information brought by the data and prior information 
arc entirely subsummcd in the posterior probability den- 
sity function obtained, up to a normalization constant, 
by 

Ti{x) oc likelihood(data jcc) x prior(a;). (1) 

It is however generally difficult to handle the posterior 
distribution, due to (a) the dimension of the parame- 
ter vector X, and (b) the use of non-analytical likelihood 
functions. For both of these reasons, the normalizing 
constant missing from the right-hand side of (H]) is usu- 
ally not explicitly available. A practical solution to this 
difficulty is to replace the analytical study of the poste- 
rior distribution with a simulation from this distribution, 
since producing a sample from tt allows for a straightfor- 
ward approximation of all integrals related with tt, due 
to the Monte Carlo principle @. In short, if xi, . . . , xn is 
a sample drawn from the distribution tt and / denotes a 
function (with finite expectation imder tt), the empirical 
average 

^ E (2) 

n=l 

is a convergent estimator of the integral 

^(/)- / f{x)A^)dx, (3) 

in the sense that the empirical mean ([2]) converges to 
7r(/) as N grows to infinity. Quantities of interest in a 
Bayesian analysis typically include the posterior mean, 
for which f{x) — x] the posterior covariance matrix cor- 
responding to f{x) = xx"^; and probability intervals, 
with f{x) ~ Is (a;), where 5 is a domain of interest, and 
l5(x) denotes the indicator function which is equal to 
one if X & S and zero otherwise. 



B. Markov chain Monte Carlo methods 

For most problems in practice, direct simulation from 
TT is not an option and more sophisticated approximation 
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techniques are necessary. One of the standard approaches 
to the simulation of complex distributions is the class 
of Markov chain Monte Carlo (MCMC) methods that 
rely on the production of a Markov chain {a;„} having the 
target posterior distribution tt as limiting distribution. 

MCMC can be implemented with many Markovian 
proposal distributions but the standard approach is the 
random walk Metropolis-Hastings algorithm: given the 
current value Xn of the chain, a new value x* is drawn 
from ip(x — Xn), where the so-called proposal ip denotes 
a symmetric probability density function. The point a;^ 
is then accepted as Xn+i with probability (also called 
acceptance rate in this context) 



min < 1 



(4) 



and otherwise, Xn+i 
mented as follows: 



' Tr{Xn) J ' 

„. The algorithm is implc- 



Random walk Metropolis-Hastings algorithm 

Do: Choose an arbitrary value of xi. 
For n > 1: 

Generate x^, ^ ^{x — Xn) and u ^ Uniform(0, 1). 
Take 



Xn+l 



x-f, if u < 7r(a:*)/7r(x„), 
Xn Otherwise. 



While this algorithm is universal in that it applies to 
any choice of posterior distribution tt and proposal -0, 
its performance highly depends on the choice of the pro- 
posal if) that has to be properly tuned to match some 
characteristics of tt. If the scale of the proposal ij} is too 
small, that is, if it takes many steps of the random walk 
to explore the support of tt, the algorithm will require 
many iterations to converge and, in the most extreme 
cases, will fail to converge in the sense that it will miss 
some relevant part of the support of tt [31 . If, on the 
other hand, the scale of ^ is too large, the algorithm 
may also fail to adequately sample from tt. This time, 
the chain may exhibit low acceptance rates and fail to 
generate a sufficiently diverse sample, even with longer 
runs. There exist monitors that assess the convergence of 
such algorithms but they usually are conservative - i.e., 
require a multiple of the number of necessary iterations 
- and partial - i.e., only focus on a particular aspect of 
convergence or on a special class of targets Q. MCMC 
algorithms are also notoriously delicate to calibrate on- 
line, both from a theoretical point of view and from a 
practical perspective [l^. For these approaches, often 
called adaptive MCMC, some recommendations for the 
optimal scaling and calibration schedule for various pro- 
posals in high dimensions have been proposed (20| . but 
this is still at an experimental stage. 



C. Population Monte Carlo 

Population Monte Carlo (PMC) J[lEj2l| is an adaptive 
version of importance sampling |22l. |23|| that produces 
a sequence of samples (or populations) that are used in 
a sequential manner to construct improved importance 
functions and improved estimations of the quantities of 
interest. 

We recall that importance sampling is based on the 
fundamental identity Q 



T^U) = / f{x)Tl{x) dx 



f{x)^q{x)dx, (5) 



which holds for any probability density function q with 
support including the support of tt and any function / 
for which the expectation 7r(/) is finite. Hence, this ap- 
proach to approximating integrals linked with complex 
distributions is also universal in that the above identity 
always holds. If xi, . . . , sat are drawn independently from 

q, 

1 ^ 

7r(/) = ]^ X! f(^ri)Wn; Wn = n{Xn)/q{Xn), (6) 
n=l 

provides a converging approximation to 7r(/) . In this con- 
text, q is called the importance function and w„ are com- 
monly referred to as importance weights. For Bayesian 
inference, one cannot directly use ([6]) as only the unnor- 
malised version of tt (i.e., the right-hand side of eq. [1]) is 
available. Conveniently, the self-normalised importance 
ratio 



'^n(/) 



N 

E 



f{Xn)Wn 



(7) 



where the normalised importance weights are defined as 



(8) 



E 



N 



Wn 



is also a converging approximation to 7r(/), independent 
of the normalization of tt. For an importance function 
that is closely matched to the target density, significant 
reductions in the variance of the Monte Carlo estimates 
are possible in comparison to estimates obtained using 
MCMC [Hi- However, the importance sampling approach 
is equally prone to poor performances as MCMC, in that 
the resulting converging approximation may suffer from a 
large or even infinite variance if q is not selected in accor- 
dance with TT. There is no universal importance function 
and most of the research in this field aims at fitting the 
most efficient importance functions for the problem at 
hand. 

Population Monte Carlo offers a possible solution to 
this difficulty through adaptivity: given the target pos- 
terior density tt up to a constant, PMC produces a se- 
quence q* of importance functions {t = 1, . . . ,T) aimed 
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at approximating this very target. The first sample 
is produced by a regular importance sampling scheme, 



q , associated with importance weights 



1,. 



(9) 



and their normalised counterparts w^^ (cq. [8]), providing 
a first approximation to a sample from tt. Moments of 
TT can then be approximated to construct an updated 
importance function g^, etc. 

The approximation can be measured in terms of the 
KuUback divergence (also called KuUback-Leibler diver- 
gence or relative entropy) from the target, 



log 



Tr{x) 
q\x) 



(10) 



and the density can be adjusted incrementally such 
that if(7r||g*) is smaller than K{Ti\\q*~^). The impor- 
tance function should be selected from a family of func- 
tions which is sufficiently large to allow for a close match 
with TT but for which the minimization of pop is com- 
putationally feasible. In [T6| the authors propose to use 
mixture densities of the form 



D 



Q*(x)=<z(x;a*,0*)=^a^^(x;0*) 



(11) 



where a* = (a* , . . . , a^) is a vector of adaptable weights 
for the D mixture components (with > and 

Yl!d=i '^d ~ 1)' ^^^'^ ^* ~ ((?*,..., 6*^) is a vector of param- 
eters which specify the components; i^j is a parameteriscd 
probability density function, usually taken to be multi- 
variate Gaussian or Student-t (where the latter is to be 
preferred in cases where it is suspected that the tails of 
the posterior tt are indeed heavier than Gaussian tails). 
Given the vast array of densities that can be approx- 
imated by mixtures, such an importance function pro- 
vides considerable fiexibility to efficiently estimate a wide 
range of posteriors, including in this case those found in 
cosmological settings. 

The generic PMC algorithm then consists of the 
following: 

Population Monte Carlo algorithm 



Do: Choose an importance function q^. 
Generate an independent sample x\, . 
Compute the importance weights w}, . 



For t > 1: 

Update the importance function to g*"*"^, based on the 
previous weighted sample {x{,w[), . . . , (a;^, w%). 



t+i 



Generate independently x-^ 
Compute the importance weights w\~^^ , 



7 -^N 



Jn ■ 



Unlike for MCMC, in a PMC approach, the process 
can be interrupted at any time as the sample produced at 
each iteration can be validly used to approximate expec- 
tations under tt using self-normalised importance sam- 
pling following ([7|). Further, samp ling outputs from pre- 
vious iterations can be combined [24l [25l| . and the sam- 
ple size at each iteration does not necessarily need to be 
fixed. Both of these properties of PMC can be exploited 
to improve parameter estimates, either by increasing the 
coverage of the importance function to the target density 
or increasing the precision of the approximation for the 
integral of interest. 

Also note that an approximate sample from the target 



density can be obtained by sampling 



,<) with 



replacement, using the normalised importance weights 
w^. Although this process induces extra Monte Carlo 
variation, there are a number of methods available which 
considerably reduce the variation involved (e.g. residual 
sampling [2y| or systematic sampling [27i|). 



1. Updating ttie importance function in tfie Gaussian case 

In this section, we particularise the generic PMC algo- 
rithm to the case where the importance function consists 
of a mixture of p-dimensional Gaussian densities with 
mean and covariance S^, 

(p(x;//d,I]rf) = (27r)-P/2|I]<i|-i/2 

1 



X exp 



{x ~ t^df^d - t^d) 



(12) 



Using this importance function for the mixture model 
(fTTj) . we start the PMC algorithm by arbitrarily fixing 
the mixture parameters (a^, E^), and then sample in- 
dependently from the resulting importance function q^ to 
obtain our initial sample [x\, . . . , sjy)- From this stage, 
updates of the parameters proceed recursively. 

At iteration the importance weights associated with 
the sample (x* , . . . , x^) are given by 



7^(4) 



(13) 



with normalised counterparts uj*^ given by eq. ([5]). The 
parameters (a*,/!* and S*) of the importance function 
are then updated according to 



N 



t+1 _ En^i ^nx\ Pdjxj; a\fj.\Y.*) 
f-d ^ „t+l 



a 



(14) 
(15) 



Ell < (4 - Pd^')i< - Pd'-'Vpdixi; a\ /i*, S*) 



(16) 
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where 



ad ipjx; fid, Sd) 
T,d=i^df{x;fid,^d) 



(17) 



Appendix |^ provides derivations of these expressions 
and further details on the general approach, as well as 
equations pertaining to the (more involved) case of mix- 
tures of multivariate Student-t distributions, which are 
used in the simulations presented in Sect. Hill 

As discussed in Appendix El the main theoretical 
appeal of this particular update rule is that, as N 
tends to infinity, the corresponding KuUback divergence 
K{TT\\q*^^) is guaranteed to be less than K{Tr\\q*). 



2. Monitoring convergence 

The above update process can be repeated a number 
of times, and although there is no need for a formal 
stopping rule, some measures of performance against the 
target density can be used as a guide. As the objec- 
tive of importance function adaptations is to minimise 
the KuUback divergence between the target density and 
the importance function, we can stop the process when 
further adaptations do not result in significant improve- 
ments in K{TT\\q*). To this end, it can be shown that 
exp[— ii'(7r||(7*)] may be estimated by the normalised per- 
plexity 



p = eMH^)/N, 



where 



N 



logw* 



(18) 



(19) 



is the Shannon entropy of the normalised weights, a fre- 
quently used measure of the quality of an importance 
sample. Thus, minimization of the KuUback divergence 
can be approximately connected with the maximization 
of the perplexity p^ . Values of this criterion close to 1 
will therefore indicate good agreement between the im- 
portance function and the target density. 

Another frequently used criterion for importance sam- 
pling is the so-called effective sample size (ESS), 



N 



essIt 



(20) 



which lies in the interval [1; N] and can be interpreted as 
the number of sample points with non-zero weight [28j . 
Both measures (flSl [20| are interconnected, as an impor- 
tance function which is close to the target density will 
have both a high normalised perplexity and a relatively 
large number of points with non-zero weight, compared 
to an ill-fitting importance function. Given a real-valued 
function / of interest one can also estimate the asymp- 
totic variance of the self-normalised importance sampling 




FIG. 1: Test target density on the {x\,X2) plane. Contours 
represent the 68.3% (blue), 95% (green) and 99% (black) con- 
fidence regions. 



N 



estimator 7r^(/) = 
importance sample itself, as 



ifi^n) (cf- eq. using the 



N 



^EK (/«) 



(21) 



Beware that this formula (which is derived from theo- 
rem 2 of [2^ ) is only valid with normalised weights, and 
that it is a variance conditional on the current impor- 
tance proposal (7*, i.e. it does not take into account the 
adaptation. This measure can be related to the so-called 
integrated autocorrelation time (I AT) used for Markov 
chain Monte Carlo simulations, which, in this case, takes 
into consideration the level of autocorrelation present in 
the chain [llHi^. 



3. A first illustration 

To illustrate the PMC approach, we explore a banana- 
like target density presented Fig. [H The same target 
distribution will be studied in greater depth in the next 
section. The results of the first eleven (11) iterations of 
the PMC algorithm using a mixture of Student-t densities 
are shown Fig. [51 (see also Appendix \^ for details of the 
update procedure). 

While this target density shows slightly more pro- 
nounced curvature for an example of a posterior density 
in practice, it serves to illustrate the process of adapta- 
tion of the importance function. The initial importance 
function is a mixture of multivariate Studcnt-t's, con- 
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sisting of nine components placed around the centre of 
the range for the first two variables, each with a rela- 
tively large variance (for the first dimension = 200, sec- 
ond = 50) and degrees of freedom = 9. The different 
coloured circles in Fig. [1] indicate the location of the com- 
ponent means, and the circle size is proportional to the 
weight ad associated with the component. At the fourth 
iteration (t = 4), we see that the importance function 
starts to resemble the shape of the target density, with 
components becoming more separated and moving into 
the tails of the target. By the sixth iteration {t = 6) the 
importance function has further adapted to the shape 
of the target banana density. For this target density and 
importance function. Fig. [3] presents estimates of the nor- 
malised perplexity and normalised effective sample size 
(ESS/iV) for the first 10 iterations over 500 simulation 
runs. As shown, the estimates of the normalised perplex- 
ity improve rapidly from approximately 0.14 for the sec- 
ond importance function (Iteration 2) to approximately 
0.81 for the last importance function (Iteration 10), with 
a similar increase in estimates of the normalised effec- 
tive sample size (ESS/TV, increasing from 0.10 to 0.60). 
For this importance function and target density, the nor- 
malised perplexity starts to level off after the 10th iter- 
ation (around 0.82), indicating that there is no need for 
further adaptation of the proposal density. As mentioned 
previously however, in general, one does not need to ob- 
serve the convergence of the proposal (as for MCMC) in 
order to stop the sampling process. 
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FIG. 2: Evolution of the importance function for the target 
density (see Fig. [1} over 11 iterations of 10k points for xi 
(horizontal axis) and X2 (vertical axis), except for the last 
iteration (11) which is a sample of 100k points. Iterations 1 
(top left) to 11 (bottom right) from left to right with every 
second iteration shown (i.e 1,3,5,7,9,11). Colours indicate the 
mixture components with mean of each component indicated 
by coloured dots and approximate 95% confidence regions for 
the sample of points from each component by coloured el- 
lipses. Every 3rd sample point from the importance function 
is plotted. 



III. SIMULATIONS 



An important consideration, and a choice that needs to 
be made at the start of the algorithm are the parameter 
values a^, /i^ and for the initial importance function 
q^, including the degrees of freedom i/ in the case of the 
Student-t mixture, and the sample size N. A poor initial 
importance function, such as one that is tightly centred 
around only one mode in the case of a multimodal poste- 
rior or a narrow importance function with light tails, may 
take a long time to adapt or may miss important parts 
of the posterior. For importance sampling the choice of q 
requires both fat tails and a reasonable match between q 
and the target tt in regions of high density. Such an im- 
portance function can be more easily constructed in the 
presence of a well informed guess about the parameters 
and possibly the shape of the posterior density. Sample 
size considerations also play an important role - smaller 
samples can adapt quite quickly with less computational 
time but may provide less reliable information about the 
posterior density relative to larger samples. Such consid- 
erations are important as we look at posterior densities 
of increasingly high dimensions, and thus we can expect 
to take a larger sample size as the dimension of the prob- 
lem increases. Wc will discuss these issues further in the 
context of simulated and actual data, and also in Sect.fVl 



In this section, we test the performance of PMC using 
simulated data, and compare the results to an adaptive 
MCMC procedure. 

1. Target density 

In order to provide a good test for both approaches 
we use the target density considered in , which is dif- 
ficult to explore but which also provides a realistic sce- 
nario for many problems encountered in cosmology. The 
target density is based on a centred p-multivariate nor- 
mal, X A/^(0, E) with covariance E = diag(crj, !,...,!), 
which is slightly twisted by changing the second co- 
ordinate X2 to X2 + b{xi — a\). Other co-ordinates are 
unchanged. We obtain a twisted density which is cen- 
tered with uncorrelated components. Since the Jacobian 
of twist is equal to 1, the target density is: 

{xi,X2 + b{x\ - af), X3, . . . , Xp) - Np{Q, S) . (22) 

For the target density that we will consider, we set p = 
10, al = 100, b = 0.03, which results in a banana-shaped 
density in the first two dimensions (sec Fig. [1]). 
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FIG. 3: Normalised perplexity (top panel) and normalised 
effective sample size (ESS/N) (bottom panel) estimates for 
the first 10 iterations of PMC (represented in Fig. [2]) over 
500 simulation runs. The distributions are shown as whisker 
plots: the thick horizontal line represents the median; the 
box shows the interquartile range (IQR), containing 50% of 
the points; the whiskers indicate the interval 1.5xIQR from 
either Ql (lower) or Q3 (upper); points outside the interval 
[QljQS] (outliers) are represented as circles. 



For the target density considered, interest is in how 
well PMC and MCMC are able to approximate the tails 
of the target. Whilst the curvature present in the first 
two dimensions of this target density is slightly more pro- 
nounced than what is typically seen in practice for cos- 
mology it serves to highlight the difficulties faced by both 
PMC and MCMC in covering the parameter space. In 
particular, little accurate information is available in or- 
der to guide the choice of importance function (for PMC) 
and proposal distribution (for MCMC) and so both ap- 
proaches are forced to learn about the parameter space. 



2. Test run proposal for PMC 

For PMC, and in the absence of any detailed a priori 
information about the target density, except the possible 
range for each variable, we have chosen the first impor- 
tance function to be a mixture of multivariate Studcnt- 
t distributions with components displaced randomly in 
different directions slightly away from the centre of the 
range for each variable: the mean of the components 
is drawn from a p-multivariate Gaussian with mean 
and covariance equal to Eo/5 where Eq is some positive- 
definite matrix; the variance for components was cho- 
sen to be Eq. We choose a mixture of 9 components 
of Studcnt-t distribution with u — 9 degrees of free- 
dom; and Eq is a diagonal matrix with diagonal entries 



(200, 50, 4, ■ • • ,4). This choice of (j^, Eq) ensures ade- 
quate coverage, albeit somewhat overdispersed, of the 
feasible parameter region. In this simulated example, 
Studcnt-t distributions are preferable to Gaussian distri- 
butions because the range of the variables is unbounded 
(in contrast to the cosmology examples to be discussed 
in Scct.|lV|. 

A representation of the first importance function for 
the first two dimensions is shown in the top left-hand 
box in Fig. [21 with a typical evolution over the next few 
iterations in the other panels. In pilot runs of various 
importance functions against the target density, the best 
fitting importance function required at least seven com- 
ponents in order to adequately represent the coverage of 
the entire density. 

For PMC, an important issue is the sample size for 
each iteration. A poor initial importance function with 
a relatively small sample size will take a long time to 
adapt or it may even be unable to recover sufficiently to 
provide reasonable parameter estimates. Such problems 
are more likely to occur as the dimension of the parame- 
ter space increases, the so-called curse of dimensionality. 
For the simulation exercise each iteration is based on a 
sample of 10k points. To prevent numerical instabilities 
in the updating of the parameters, components with a 
very small weight (< 0.002) or containing less than 20 
sample points are discarded. 



3. Test run proposal for MCMC 

As little information is available for the target density, 
an adaptive MCMC approach is used which can allow for 
faster learning of the target density than using either in- 
dependent or non-adaptive random walk proposals [sij . 
For MCMC, the proposal distribution is a centred Gaus- 
sian with covariance matrix which is updated along the 
iterations. An important choice for adaptive MCMC con- 
cerns the scaling of the proposal and the rate of adapta- 
tion. There has been much research on this (Sf], |3^ , and 
a common choice for the covariance of the Gaussian pro- 
posal is to consider c E„ where E„ is an estimate of the 
covariance of the target density, at update n. The choice 
c = 2.38^/p is considered to be optimal when the chain 
is in its stationary regime, and for target densities that 
have a product form [3l| . However for the target density 
we consider this does not hold: the first two components 
are not independent despite being uncorrelated and de- 
pendence is not linear but quadratic. However, with no 
other theoretical results to follow we start with a scal- 
ing factor of that form and for the simulation results to 
follow assess the effect on convergence and results using 
alternative values. We update the covariance matrix by 
the recursive formula 



E,i — (1 — a„)E„_i -|- a„S'„ 



(23) 



where E„_i is the sample covariance of the previous up- 
date, and Sn is the covariance of the sampled estimates 
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from the previous update to the current iteration. The 
value of a„ is 1 jn^ with fc chosen suitably to allow for a 
cooling of the update, which is a necessary condition to 
ensure convergence of this adaptive MCMC to the tar- 
get density as well as convergence of the empirical aver- 
ages js^ l33l|. 

In pilot runs, we explored the effect of this schedule 
for various values of (fc, c) in (0, 1) x (0, 2.38^/p) and we 
observe that the choice of (/c, c) plays a role on the time 
to convergence (for the estimation of the quantities of 
interest, see below) and on the acceptance rate of the 
chain. 

To ensure a fair comparison with PMC, we start the 
chain at a random point drawn from the same Gaussian 
distribution as for PMC (i.e 7\/d(0, So/5), using the same 
values for Sq as used for PMC). We also explored in pilot 
runs the role of the initial value of the chain: despite 
it being known that MCMC is sensitive to the choice 
of the initial position of the chain - which has no real 
counterpart in PMC - this hasn't been found to have 
a major impact on performance (for reasonable choice of 
the initial value at least) in this particular study. We also 
fixed the update schedule to be every 10k points and we 
assessed the effect on the results from using less or more 
points before updating. For the simulation results to 
follow, (fc, c) has been set to (0.5, 2.38^/p) which ensures 
convergence after the burn-in period (see Sect. IIII A[) . 
and a mean acceptance rate at convergence of about 10%. 
The proposal distribution is updated for the entire length 
of the chain and is not stopped after the burn-in period. 



A. Test runs 

For PMC and the proposal outlined, the perplexity ap- 
peared to level off at around the 10th iteration, so for the 
results to follow for PMC we ran the PMC algorithm for 
10 iterations (10k points per iteration) and used a final 
draw of 100k points. To assess MCMC for the same num- 
ber of points we used a chain length of 200k points with 
a burn-in of 100k points. Results for both approaches at 
successive intervals before 200k points are also provided. 
To assess the performance of the approaches, each simu- 
lation was replicated 500 times. 



we provide the results for various functions / of interest: 



/a(x) 


= Xi 


/b(x) 


= X2 


/c(x) 


= l68.3(x) 


/d(x) 


= l95(x) 


/e(x) 


= l68.3(a;i,X2) 




= lg5(a;i,a;2) 


A(x) 


= l68. 3(2:1) 




= l95(a;i) 



We note here Ig as the indicator function for the g% re- 
gion, jh and /i are indicators only for the first dimension, 
while /e and jg are dealing with the first 2. In all cases, 
the remaining dimensions are marginalised over. 

Table [T] shows the results for estimation of 7r(/) for 
functions /a and {x\ and xi respectively). The re- 
sults provided show the mean and standard deviation of 
estimates calculated over 500 runs. Although the per- 
formance is quite similar for both methods, PMC does 
display a two-fold reduction in standard deviation com- 
pared to MCMC for both functions. A closer look at 
the results reveals that for 7r(/a) the empirical distribu- 
tions of the estimates (see Figure H]) are quite similar 
for both methods, except for the variance which is much 
reduced for PMC. For 7r(/f,), on the other hand, the em- 
pirical distribution of the estimates for PMC are quite 
skewed, resulting in a slight positive bias for the major- 
ity of the runs (second panel of Figure 2]). The difference 
between 7r(/a) and 7r(/h) can be explained from Figure [T] 
which shows that failure to visit sufficiently the down- 
ward low-probability tails does indeed imply a positively 
biased estimates for the mean of the second component. 
PMC does appear to be more sensitive to this issue than 
MCMC, despite the fact that the estimates for MCMC 
display a larger overall variability. 

Figure O provides the results for the confidence region 
coverage. To depict the variability of the data, the results 
are displayed by using whisker plots. The results from 
both PMC and MCMC against all of the performance 
measures arc similar, with both showing good coverage 
of the target density. The distribution of this estima- 
tor is again more skewed for PMC than it is for MCMC, 
particularly for the 95% regions in the bottom panel of 
Figure Nevertheless, the variability of the estimates 



B. Results of the simulations 

For the results of the simulations, we are interested in 
both the mean estimates of the parameters (in particu- 
lar for x\ and x-i) and also estimates of the confidence 
region coverage (68.3% and 95%) which will provide an 
indication of how well both approaches are covering the 
tails of the target density. For each run r = 1, . . . , 500, 



TABLE I: Results of the simulations for the 10-dimensional 
banana shaped target density over 500 runs for both PMC 
and MCMC 



PMC MCMC 




mean 


0.097 -0.028 




std 


0.218 0.536 




mean 


0.013 0.002 




std 


0.163 0.315 


acceptance 




0.11 


perplexity 




0.80 
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FIG. 4; Evolution of 7r(/a) (top panels) and 7r(/i,) (bottom 
panels) from 10k points to 100k points for both PMC (left 
panels) and MCMC (right panels). See Fig.|3]for details about 
the whisker plot representation. 
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FIG. 5: Results showing the distributions of the PMC and the 
MCMC estimates vr(/) for (top) / = fc, fe, fh and (bottom) 
/ — fd,fg,fi (in this order, left to right). All estimates are 
based on 500 simulation runs. See Fig. [3] for details about the 
whisker plot representation. 



the Monte Carlo estimates for PMC in comparison to 
MCMC. In particular, it is interesting to note that the 
variance of the estimates, either for Tr{fa) or 7r(/b) for 
100k posterior evaluations under MCMC are comparable 
to estimates obtained using PMC at the second iteration 
(20k points). 

Simulating from this target distribution is a challeng- 
ing problem for both methods. In particular, the use 
of a vague initial importance function in a multidimen- 
sional space represents a challenge to PMC and it has 
been observed that the importance function takes some 
time to properly adapt to the target density (about 10 
iterations). The choice of the initial importance function 
in PMC is more crucial than is the choice of the initial 
proposal distribution in adaptive MCMC. Although dif- 
ferent variations for updating the covariance matrix for 
the MCMC approach are possible we did not see a sig- 
nificant improvement in the results presented from using 
alternative covariance structures. For most of the simula- 
tion results, the proposal covariance matrix was observed 
to adapt relatively quickly to the true covariance matrix. 
Changes to the covariance structure considered included 
changes to the update frequency, the starting proposal 
So, the scaling of the proposal (value of c) and adap- 
tation of the covariance (value of k). Hence, the PMC 
approach may require more precise a priori knowledge of 
the target density than MCMC. 

In the next section, we apply the PMC approach to 
typical cosmological examples, and provide results in 
comparison to MCMC. 



IV. APPLICATION TO COSMOLOGY 

We apply our new adaptive importance sampling 
method to the posterior of cosmological parameters. 
Flat CDM models with either a cosmological constant 
(ACDM) or a constant dark-energy equation-of-state pa- 
rameter (wCDM) are explored and tested with recent ob- 
servational data of CMB anisotropies, supernovae type la 
and cosmic shear, as described in the next section. 

The three data sets and likelihood functions used here 
arc the same as in 
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the CMB measurements and like- 
lihood are based on the five-year WMAP data release 
[sst . the SNIa data set is the first-year SNLS survey [11], 
while the cosmic shear is from the CFHTLS-Wide third 
release [T0003,|33|- The results presented in the follow- 
ing sections can be compared to the MCMC analysis in 



obtained with PMC also is significantly reduced com- 
pared to MCMC. 

Figure |4] shows the evolution of the results for 7r(/a) 
and 7r(/f,) from 10k points to 100k points for both PMC 
(left panels) and MCMC (right panels). The results from 
Fig. m in general, highlight the reduction in variance of 



A. Data sets 
1. CMB 

To obtain theoretical predictions of the CMB temper- 
ature and polarization power- and cross-spectra we use 
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the publicly available package CAMB Q . The likelihood 
is calculated using the public WMAP5 code [J. 

The WMAP5 hkelihood takes as input the TT,TE,EE 
and BB theoretical power spectra calculated by CAMB, 
and returns a likelihood computed from a sum of differ- 
ent parts. It computes a pixel-based Gaussian likelihood 
based on template-cleaned maps [s^ and their associated 
inverse covariance matrices (see Page et al. [s^ for de- 
tails) at large angular scales [i < 32 for TT, £ < 23 for 
TE, EE and BB). At small angular scales, it computes 
an approximate likelihood based on pseudo-spectra and 
their associated covariance for TT and TE iO], based 
respectively on the (Q,V) and (V,W) channel pairs for 
TE and TT. 

In addition, the likelihood computation takes into ac- 
count analytic marginalisations on nuisance parameters 
such as the beam transfer function and point-sources un- 
certainties [13, El- We ignore corrections due to SZ and 
impose a larger (flat) prior on the Hubble constant. In- 
deed, CMB data alone exhibits a degeneracy between the 
Hubble constant and e.g. the cosmological constant [i^l 
which is removed by adding other cosmological probes. 

The acoustic oscillation peaks in the CMB anisotropy 
spectrum are a standard ruler at a redshift of about 
z = 1100. CMB therefore measures the angular diam- 
eter distance to that redshift which depends mainly on 
the total matter-energy density (Slm-l-rJdc) and weakly on 
the Hubble constant h. The overall anisotropy amplitude 
is determined by the large-scale normalization A^. The 
relative height of the peaks is sensitive to the baryonic 
and dark matter densities. On large scales, secondary 
anisotropics are generated at late times (z ^ 20) due to 
reionisation, which is parametrised by the optical depth 
T, and the integrated Sachs- Wolfe (ISW) effect, which is 
a probe of fide- 



2. SNIa 

The SNLS data set is described in detail in [3^. We 
use their results from the SNIa light-curve fits which, for 
each supernova, provides the rest-frame _B-band magni- 
tude m*g , the shape or stretch parameter s and the colour 
c. We use the standard likelihood analysis described in 
[U, adopted from 13(1. 

Under the assumption that supernovae of type la are 
standard candles we can fit the luminosity distance to the 
SNIa data. The luminosity distance is a function of Qra, 
ride and w. Three additional parameters are the univer- 
sal SNIa magnitude M and the linear response parame- 
ters to stretch and colour, a and /3, respectively. Those 
three parameters are specific to our choice of distance 
estimator, and can be regarded as nuisance parameters. 
The Hubble constant h is integrated into the parameter 
M, so there is no explicit dependence on h in the SNIa 
posterior. 



3. Cosmic shear 

The CFHTLS-Wide 3'''^ year data release (T0003), the 
data and weak Icnsin g an alysis as well as cosmological re- 
sults are described in [37|| . As in [s^] we use the aperture- 
mass dispersion between 2 and 230 arc minutes as second- 
order lensing observable [ist . We assume a multivari- 
ate Gaussian likelihood function and take into account 
the correlation between angular scales. The theoretical 
aperture-mass dispersion is obtained by non-linear mod- 
els of the large-scale structure [44 [ . 

The galaxy redshift distribution is obtained by using 
the CFHTLS-Deep redshift distribution [i^ and rescal- 
ing it according to the iAB magnitude distribution of 
CFHTLS-Wide galaxies. We fit the resulting histogram 
with eq. (14) from [l^], introducing the three fit parame- 
ters a, b, c. The histogram data is modeled as multivari- 
ate, uncorrelated Gaussian, the corresponding likelihood 
is included, independent of the lensing likelihood, in the 
analysis. 

Weak gravitational lensing by the large-scale struc- 
ture is sensitive to the angular diameter distance and the 
amount of structure in the Universe. It is an important 
probe to measure the normalisation ag on small scales. 
With the current data, this parameter is however largely 
degenerate with fim. This degeneracy is likely to be lifted 
by future surveys which will include the measurement of 
higher-order statistics [i3,[43| and shear tomography [ist . 
In particular from the latter a great improvement on the 
determination of w is to be expected, a parameter which 
is only weakly constrained by lensing up to now [i^, . 



B. Cosmological parameter and priors 

We sample a hypcrcube in parameter space which cor- 
responds to flat priors for all parameters, sec Table |TT] 
for more details. Additional priors exist, both in explicit 
and implicit form, which represent regions of parameter 
space which are unphysical or where numerical fitting for- 
mulae break down. For example, we exclude extremely 
high baryon fractions (fib > O.TSOm) because of numeri- 
cal problems in the computation of the transfer function. 
Further, for very low values of both fl^ and as the pivot 
scale for the non-linear power spectrum is outside the 
allowed range. Very rarely, the calculation of the likeli- 
hood for individual points in parameter space is unsuc- 
cessful because of numerical errors or limitations of the 
likelihood code. Since these points cannot be taken into 
account, a pragmatic solution is to formally modify the 
prior to exclude those points. Note that these rare cases 
occur mainly in regions of very low likelihood. 



C. Initial choice of the importance function 



As described earlier in Sect. Ill C 3[ it is important to 
have a good guess for the initial importance function. In 
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TABLE 11: Parameters for tlie cosmology likelihood. C=CMB, S=SNIa, L=lensing. 



Symbol Description Minimum Maximum Experiment 

Th, Baryon density OM Ol C L 

Total matter density 0.01 1.2 C S L 

w Dark-energy eq. of state -3.0 0.5 C S L 

Tis Primordial spectral index 0.7 1.4 C L 

Afj. Normalization (large scales) C 

CT8 Normalization (small scales)" C L 

h Hubble constant C L 

T Optical depth C 

M Absolute SNIa magnitude S 

a Colour response S 

P Stretch response S 

a L 

b galaxy ^-distribution fit L 

c L 



"For WMAP5, erg is a deduced quantity that depends on the other 
parameters 



all cases considered here, we rely on an estimate of the 
maximum likehhood point and the Hessian at that point 
(Fisher matrix) to build our initial proposals. We use the 
conjugate-gradient approach [sij to find the maximum 
likelihood point at which to calculate the Fisher matrix 
F using the theoretical model. We construct a mixture 
model consisting of D Gaussian components. Student-t 
mixtures with small degrees of freedom were tested and 
turned out to be a bad approximation to the posterior 
under study, resulting in low perplexities. Each mixture 
component is shifted randomly from the maximum by a 
small amount. A random scaling is applied to the covari- 
ance of each component; i.e. the eigenvectors and ratios 
between the eigenvalues of the covariance are the same 
as the ones of the Fisher matrix. 

We obtain good results for shifts of about 0.5% to 2% 
of the box size. Here, a trade-off between too large shifts 
(resulting in low importance weights) and too small shifts 
(components stay near the maximum, the posterior tails 
do not get sampled) has to be found. The stretch fac- 
tor is chosen randomly between typical values of 1 and 
2. In some cases, in particular with high dimensionality, 
the derivation of the Fisher matrix is not stable and the 
matrix is numerically singular. In such cases we set the 
off-diagonal elements of F to zero. 

We found a sample size between 7 500 and 10 000 points 
to be adequate for most cases. The number of compo- 
nents D of the initial importance function was chosen 
between 5 to 10. For the final iteration we used a sample 
size five times that of the initial sample size. 



D. Results 

1. General performance 

The PMC algorithm is reliable and very efficient in 
sampling and exploring the parameter space. Both the 



perplexity as well as the effective sample size increase 
quickly with each iteration (Fig. [S]). The perplexity 
reaches values of 0.95 or more in many cases, although in 
particular for higher dimensional posteriors the final val- 
ues are lower. Satisfactory results (i.e. yielding consistent 
mean and marginals compared to MCMC, see below) are 
obtained for perplexities larger than about 0.6. 

The distribution of importance weights gets narrower 
from iteration to iteration (Fig. [7]). Initially, many sam- 
pled points exhibit very low weights. After a few iter- 
ations, the importance function has moved towards the 
posterior increasing the efficiency of the sampling. 
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FIG. 6: Perplexity (left panel) and normalised effective sam- 
ple size ESS/Af (right panel), as a function of the cumulative 
sample size N. The likelihood is WMAP5 for a flat ACDM 
model with six parameters. 

Our initial mixture model starts with all mixture com- 
ponents close to the maximum likelihood point. With 
consecutive iterations the components spread out to bet- 
ter cover the region where the posterior is significant. 
This can be seen in Fig. [S] 

Compared to traditional MCMC, our new PMC 
method is faster by orders of magnitude. The time- 
consuming calculation of the posterior can be performed 
in parallel and therefore a speed-up by a factor of the 
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FIG. 7: Histogram of the normalised importance weights iD^ 
for four iterations t — 0,3, 6, 9. The posterior is WMAP5, flat 
ACDM model with six parameters. 
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FIG. 8: Lower left panel: Overlaid to the SNIa confidence con- 
tours (68%, 95%, 99.7%) is the movement of the importance 
function. For each iteration a circle is plotted at the posi- 
tion of the mean of each component, where different colours 
indicate different components. The circle size indicates the 
component weight. The starting point (first iteration, at 
(0.3, —1.3)) is marked by a thick circle. The other two pan- 
els show the mean positions in projection, fanned out as a 
function of the iteration. 



number of CPUs is obtained. In times where clusters of 
multi-core processors are readily available, this speed-up 
is easily of the order of 100. In addition, MCMC has 
a low efficiency with typical acceptance rates of 0.25 - 
0.3. The PMC normalised effective sample size in the 
WMAP5 case is 0.7 which results in a much larger final 
sample for the same number of posterior calculations of 




FIG. 9: The sampled points from the final iteration are plot- 
ted, the colours indicate the components of the importance 
function from which the points are drawn (the colours are the 
same as in Fig. |8]). One out of five point is shown. Note that 
the density of points does not correspond to the posterior den- 
sity since the former has to be weighted by the importance 
weights. 



around 150 000. 

Wc emphasise again that with MCMC one can make 
only limited use of parallel computing since one has to 
wait for each Markov chain to converge, and because it 
is not straightforward to combine chains, as mentioned 
earlier. 



2. Comparison with MCMC 

The MCMC results we present here are either obtained 
using the adaptive MCMC algorithm or a classical one. 
Indeed, as we show in the following, adaptive MCMC 
can have some issues that a less efficient classical MCMC 
algorithm can avoid. Apart from those special cases, the 
MCMC and adaptive MCMC gave very similar results, 
the latter usually reaching a better acceptance rate, and 
thus a better efficiency. 

We find excellent agreement between using our respec- 
tive implementations of MCMC (adaptive or not) and 
PMC. Mean, confidence intervals and 2d-marginals are 
very similar using both methods. The performance of 
PMC is superior to MCMC in some cases, which is illus- 
trated by the following examples. 

An inherent problem of MCMC is that even for a long 
run there can be regions in parameter space that arc not 
sampled in an unbiased way. This is illustrated in Fig. 1101 
The feature at the 99.7%-level of MCMC (left panel, for 
large values of —M and a) is due to an "excursion" of the 
chain into a low-likelihood region at step 130 500, lasting 
for 300 steps. We ran the chain for 300 000 steps and the 
feature was still visible. A second run of the chain did not 
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FIG. 10: Examples of marginalised likelihoods (68%, 95% and 
99.7% contours are shown) for PMC (solid blue) and MCMC 
(dashed green) from the SNIa data. 



25 



20 



■g 15 

B 
o 

^ 10 



PMC 

adaptive MCMC 
MCMC no update 



I \ 
I \ 



0.02 0.04 0.06 0.08 0.10 



exhibit this anomaly. This kind of sample "noise" can be 
prevented by running a chain for a very long time or by 
combining several (converged) chains. Such features are 
much less likely to occur in an importance sample which 
consists of uncorrelated points. 

A second issue are parameters which are nearly uncon- 
strained by the data with the result that the marginalised 
posterior in that dimension is flat. To illustrate this we 
choose weak lensing alone which can not constrain fib 
(Fig. [TT|) . Using the Fisher matrix as initial Gaussian 
proposal for adaptive MCMC, the chain stays in a small 
region in the f2b-direction; the covariance being very flat, 
most jumps ends up out of the prior distribution. This 
results in an update variance for this parameter which is 
much too small, and in a bad exploration of the poste- 
rior in this flat direction as shown Fig. [TT] The classical 
MCMC algorithm, with the same proposal yields better 
results, but with a very low acceptance rate and needing 
500 000 steps to reach the result presented Fig.[TlJ Alter- 
natively, modifying the initial proposal to be smaller and 
better adapted to the prior, or increasing the covariance 
stretch factor from the optimal value of c = 2.38^/p (see 
Sect. nil "3t to c = helps the chain to explore more 

of the parameter space in the latter steps of the adap- 
tation. These modifications to the algorithm also result 
in very low acceptance rate, and somehow go against the 
very idea of an adaptive algorithm, since they require 
very fine tuning of the initial proposal! 

With PMC we obtain a much better performance and 
recover very well the flat posterior. 

In Tables Hill and [TVl we show the mean and 68% con- 
fidence intervals for CMB alone for the ACDM model 
and for lensing+SNIa-f CMB using wCDM, respectively. 
The differences in mean and 68%-confidence intervals 
is less than a few percent in most cases. Figure [TOl 
shows that the lower-confidence regions and the correla- 
tion between parameters agrees very well between (non- 
adaptive) MCMC and PMC. 



FIG. 11: Normalised Id marginals for Sib from weak lensing 
alone for PMC (solid blue lines) and MCMC (dashed green: 
adaptive, dotted red: non-adaptive). 



TABLE III: Parameter means and 68% confidence intervals 
for PMC and (non-adaptive) MCMC from the WMAP5 data. 



Parameter 


PMC 


MCMC 


!lb 


"•U^'i^'i-0. 00290 


O.OUlStVollll 
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n 9f;qQ+0.0340 
u.^uoo_Q 0282 


0.2626l°»i;^ 


r 


o.o878«:S;i^J 


0.0885l°:»i- 


Us 


0.9622t°:«l« 


0.9628l°«li^ 


lO^Al 


2 .0-1+0.118 




h 


n 711fi+0 0271 
U. (iiD_o.o261 


+0.0274 
U. / l^O_0.0268 



V. DISCUSSION 

In this paper, we have introduced and assessed an 
adaptive importance sampling approach, called Popula- 
tion Monte Carlo (PMC), which aims to overcome the 
main difficulty in using importance sampling, namely the 
reliance on a single efficient importance function. PMC 
achieves this goal by iteratively adapting the importance 
function towards the target density of interest. A signifi- 
cant appeal of the approach, when compared to alterna- 
tives such as MCMC, lies in the possibility to use (mas- 
sive) parallel sampling which considerably reduces the 
computational time involved in the estimation of param- 
eters for many astrophysical and cosmological problems. 
Simulated and actual data have been used in this work to 
assess the performance of PMC for estimation of param- 
eters in a Bayesian inference with features approaching 
classical cosmological parameter posteriors. 

The PMC approach is, in essence, an iterated impor- 
tance sampling scheme that simultaneously produces, at 
each iteration, a sample approximately simulated from a 
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TABLE IV: Parameter means and 68% confidence intervals 
for PMC using lensing, SNIa and CMB in combination. The 
(non-adaptive) MCMC results correspond to the values given 
in Table 5 from [sj. 
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target distribution tt and an approximation of tt in the 
form of the current proposal distribution. As such, the 
samples produced by the PMC approach can be exploited 
as regular importance sampling outputs at any iteration 
t. Samples from previous iterations can be combined [25| . 
and approximations like 7r(/) can be updated dynami- 
cally, without necessarily requiring the storage of sam- 
ples. 

Although adaptation of the importance function has 
the explicit aim of improving the coverage of the posterior 
density there are instances where this objective may not 
be met. In some cases, successive updates of the impor- 
tance function may result in: (a) an importance function 
which is too peaked and which has light tails (invalid im- 
portance function); (b) an importance function which fits 
only one mode (in the case of a multimodal posterior); 
(c) numerical problems due to the adaptation procedure 
(usually involving poor conditioning of some of the co- 
variance matrices). Such cases are likely to produce a 
poor approximation to the integral of interest, or alter- 
natively lead to highly variable parameter estimates over 
iterations. These problems can be quickly discovered or 
signalled by observing a poor ESS, and parameter es- 
timates or normalised perplexity which do not stabilise 
after a few iterations. 

Such cases of poor performance as outlined can be sig- 
nificantly reduced by choosing a reasonably well informed 
initial importance function with a large enough sample 



size at each iteration, especially on the initial iteration 
that requires many points to counter- weight a potentially 
poor importance function. In general, the initial impor- 
tance function should be chosen to cover a region of the 
parameter space that has support larger than the poste- 
rior. In the absence of reliable prior information, finding 
such an importance function may be difficult to do. One 
approach may be to locate the components in the centre 
of the feasible range (if available) for each variable, with 
reasonably large variances to ensure some coverage of the 
parameter space. We found this approach to be reason- 
ably successful for the simulated data case discussed in 
Sect, mil In the presence of some prior information, for 
example an estimate of the maximum-likelihood point 
and an approximation of the covariance matrix (via the 
Hessian), components can be placed around these points 
with variance comparable to the approximation. Another 
approach may be to perform a singular value decomposi- 
tion of the covariance matrix, and make use of the eigen- 
vectors and eigenvalues to place components along the 
most likely directions of interest. Alternatively and in 
the same spirit, components can be placed according to 
the principal points of the resulting sample, using a k- 
means clustering approach [s^ . Both approaches have 
been reasonably successful for a range of posterior den- 
sities examined, and by placing components in regions 
of high posterior support in addition to the mode have 
the potential to further significantly reduce the number 
of iterations for difficult posterior densities. 

The main appeals or advantages of the PMC method 
are worth re-emphasising at this point: 

1. Parallelisation of the posterior calculations 

2. Low variance of Monte Carlo estimates 

3. Simple diagnostics of 'convergence' (perplexity) 

We address these three points in more detail now. 

(1.) The first advantage, namely the ability to par- 
allelise the computational task, is becoming increasingly 
useful through the availability of cheap multi-CPU com- 
puters and the standardization of clusters of computers. 
Software to implement the parallelisation task, such as 
Message Parsing Interface (MPI) ^, are publicly avail- 
able and relatively straightforward to implement. For 
the cosmological examples presented (Sect. ITV)) . we used 
up to 100 CPUs on a computer cluster to explore the cos- 
mology posteriors. In the case of WMAP5, this reduced 
the computational time from several days for MCMC to 
a few hours using PMC. 

(2.) In general, for PMC and an importance function 
that is closely matched to the target density, significant 
reductions in the variance of the Monte Carlo estimates 
are possible in comparison to estimates obtained using 



http://www-unix.mcs.anl. gov /mpi/ 
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MCMC [1]. For example, for the posterfor estimates for 
the WMAP5 data we observed a 10-fold reduction in vari- 
ance for the same number of sample points as observed for 
MCMC. Such reductions suggest that the computational 
time savings extend not only to the number of CPU's 
available but to smaller sample requirements for PMC in 
total compared to MCMC to achieve similar variability 
of estimates. For cosmological applications, this observa- 
tion is valuable as we observed, e.g., in Fig. [6] for CMB 
data, that the fit between the adapted importance func- 
tion and the target posterior is sometimes quite good. 
By combining samples across iterations further compu- 
tational savings are also possible. The absence of con- 
struction of a Markov chain for PMC can also have the 
desirable attribute of reducing sample noise, as observed 
for the SNIa data in Sect. lIVD2l 

(3.) As shown in Sect. Ill C 21 the perplexity (eq. [TS]) 
is a relatively simple measure of sampling adequacy to 
the target density of interest. For MCMC and other ap- 
proaches which rely on formal measures of convergence, 
assessment of convergence can be very difficult with users 
facing a potential array of associated diagnostic tests. 

In addition to the above points, a further appeal of 
PMC is the ability to provide a very good approximation 
to the marginal posterior or evidence, which naturally 
follows as a byproduct of the approach. To demonstrate 
this appeal, further research is underway to explore the 
use of PMC in the context of model selection problems 
in cosmology. 
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APPENDIX A: DETAILS OF THE IMPORTANCE 
FUNCTION UPDATES FOR PMC 

The method proposed in [l6j to adaptively update the 
parameters of the importance function g* is based on 
a variant of the EM (Expectation-Maximization) algo- 
rithm (53j . which is the standard tool for the estimation 
of the parameters of mixture densities. We describe be- 
low the principle underlying the algorithm of (l6| , show- 
ing in particular that each iteration decreases, up to the 
importance sampling approximation errors, the KuUback 
divergence between the target tt and the importance func- 
tion g*. 



Remember that our goal is to minimise (|T0)) . as t in- 
creases, by iteratively tuning the parameters a*, 9* of the 
mixture importance function defined in pT|) . Developing 
the logarithm in (fTU]) . this objective can be equivalcntly 
formulated in terms of the maximization of the following 
quantity 

i{a, 0)= J log aMx; Od)^ <x) dx (Al) 



with respect to a, 9. Using Baycs' rule, we denote by 



Pdix;a,0) 



T.d=l^dV{x\0d) 



(A2) 



the posterior probability that x belongs to the dth com- 
ponent of the mixture (for the mixture parameters a, 9). 
The EM principle consists of evaluating, at iteration <, 
the following intermediate quantity 



LHa. 



D 



^ Pd{x; a*, 6'*)log {ad^{x\ 0d)) ti{x) dx. 



(A3) 

Using the concavity of the log as well as the expression 
of pd in (jA2[) . it is easily checked that 



D 

E 



Pd{x]a\0*^)\oi 



ad^jx; 0d) 
a\if{x;0\) 



T,d=i^dVix;0d) 



< log 



(A4) 



and hence that L\a, 0) ~ L\a\0^) < ^(a, 0) - i{a\0*). 
Thus, any value of a, which increases the intermediate 
quantity i* above the level L*(a*,6'*) also results in, at 
least, an equivalent increase of the actual objective func- 
tion i. In the EM algorithm, one sets a*"'"^ and 0*~^^ 
to the values where the intermediate quantity L^{a, 0) is 
maximal, thus satisfying the previous requirement. Fur- 
thermore, the maximization of L*(a, 0) leads to a closed 
form solution whenever ip belongs to a so-called exponen- 
tial family of probability densities. 

In the example of the multivariate Gaussian density 
recalled in p^ . the parameter 0d consists of the mean 
Pd and the covariance matrix and the intermediate 
quantity may be written as 

L*(a,^,S)= / ^pd{x\a\p\Y.^)\ log(ad) 

d=l ^ 

- i [log iSdl + (x - Pd^^^d^x - Pd)] |^(a:) dx, (A5) 

up to terms that do not depend on a, p or E. Routine 
calculations show that the maximum of (IA5I) is achieved 
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for 



where 



t+i _ / xpd(x;a*,/i*,S]*)7r(a;)dx 
— t+i ' 

,t+i 



(A6) 
(A7) 



J{x - ^i'+'){x - tj'+TPdix- a',fi\ E*)7r(a;)dx 



(A8) 



Pd{x\a,0) 



adT{x;vd,Pd,^d) 

Yjd=l <^dT{x] Vd, Pd, Sd) 



(A9) 



In practice, both the numerator and denominator of each 
of the above expressions are integrals under tt which 
must be approximated. The solution proposed in [l3| 
is based on self-normalised importance sampling using 
the weighted sample simulated at the previous iteration 
(x* , wj), . . . , (a:^, w%). The corresponding empirical up- 
date equations are given in eqs. (fT4|) -(fT6 l) of Sect. Ill CTI 
The Student-t distribution provides a family of mul- 
tivariate densities with parameters fi and E which have 
the same interpretation as in the Gaussian case (except 
for the fact that the covariance is equal to ly/iiy — 2)E 
rater than S) but with an additional shape factor > 2 
which allow for heavier tails: letting — > oo yields back 
the Gaussian but for = 2, one obtains a density with 
polynomially decreasing tails whose only finite moments 
are the two first ones (note that it is also possible to ex- 
tend the family to the case where < < 2). Using 
mixtures of Student-t distributions will thus be mostly 
useful in cases where the target posterior distribution tt 
itself has heavy tails. The parameter update correspond- 
ing to mixtures of Student-t distributions is a bit more 
involved but follows the same general pattern. For the 
sake of completeness, we just recall below the formulas 
given in p^ : 



N 

E 



WnPd{xl;a',9'), 



t+1 _ J2n=i KPd{xi;a\ O^hdjxi; 9^)xl 

Pd 



^d — 



N 



J2n=iKPdixi;at,0*) 



^<p,(x^;a*,0*)7d(4;0*)(x:,-/i*+i)(x:,-M^+i)^ 

Tl = l 



ld{xi;9) = 



i^d+p 



i^d + {x - PdV i^d) ^{x-fid} 



(AlO) 



and t{-] fijTi,!^) denotes the p-dimensional Student-t 
probability density function. 



T{x;fi,T.,iy) = 



r(://2)i/P/27rP/2 



1 + -{x - p)'-j:-'{x - fi) 



(All) 



Sampling from a multivariate Student-t distribution is 
most easily undertaken by using its derivation in terms of 
a multivariate Gaussian (Y ~ Nk(0, S)) and chi-squared 
distribution {Z ~ xt)i 



and taking advantage of the fact that sampling from Y 
and Z is straightforward. 
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